Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
!pip install imbalanced-learn --user
# this will help in making the Python code more structured automatically (good coding practice)
!pip install black[jupyter] --quiet
# to work with dataframes
import pandas as pd
import numpy as np
from numpy import array
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
custom = {"axes.edgecolor": "purple", "grid.linestyle": "solid", "grid.color": "black"}
sns.set_style("dark", rc=custom)
# to split data into train and test
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn import metrics
# to build logstic regression model
from sklearn.linear_model import LogisticRegression
# to create k folds of data and get cross validation score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# To impute missing values
from sklearn.impute import SimpleImputer
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# to impute missing values
from sklearn.impute import KNNImputer
# To build a decision tree model
from sklearn.tree import DecisionTreeClassifier
# To help with model building
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
StackingClassifier,
)
!pip install xgboost
from xgboost import XGBClassifier
# To get different performance metrics
import sklearn.metrics as metrics
from sklearn.metrics import (
classification_report,
confusion_matrix,
recall_score,
accuracy_score,
precision_score,
f1_score,
roc_auc_score,
ConfusionMatrixDisplay,
precision_recall_curve,
roc_curve,
make_scorer,
)
# To undersample and oversample the data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
# To build a Random forest classifier
from sklearn.ensemble import RandomForestClassifier
# To tune a model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.decomposition import PCA
from imblearn.pipeline import Pipeline as ImbPipeline
from scipy.stats import randint
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x) # to display percentages rounded off to 3 decimal places
# to ignore warnings
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (0.10.1) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.25.2) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.11.4) Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.2.2) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (3.5.0) Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (2.0.3) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.25.2) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.11.4)
# let colab access my google drive
from google.colab import drive
drive.mount("/content/drive")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Loading the dataset - sheet_name parameter is used if there are multiple tabs in the excel file.
df_train = pd.read_csv("/content/drive/MyDrive/Python_Course/Project_6/Train.csv")
df_test = pd.read_csv("/content/drive/MyDrive/Python_Course/Project_6/Test.csv")
df_train.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.465 | -4.679 | 3.102 | 0.506 | -0.221 | -2.033 | -2.911 | 0.051 | -1.522 | 3.762 | -5.715 | 0.736 | 0.981 | 1.418 | -3.376 | -3.047 | 0.306 | 2.914 | 2.270 | 4.395 | -2.388 | 0.646 | -1.191 | 3.133 | 0.665 | -2.511 | -0.037 | 0.726 | -3.982 | -1.073 | 1.667 | 3.060 | -1.690 | 2.846 | 2.235 | 6.667 | 0.444 | -2.369 | 2.951 | -3.480 | 0 |
| 1 | 3.366 | 3.653 | 0.910 | -1.368 | 0.332 | 2.359 | 0.733 | -4.332 | 0.566 | -0.101 | 1.914 | -0.951 | -1.255 | -2.707 | 0.193 | -4.769 | -2.205 | 0.908 | 0.757 | -5.834 | -3.065 | 1.597 | -1.757 | 1.766 | -0.267 | 3.625 | 1.500 | -0.586 | 0.783 | -0.201 | 0.025 | -1.795 | 3.033 | -2.468 | 1.895 | -2.298 | -1.731 | 5.909 | -0.386 | 0.616 | 0 |
| 2 | -3.832 | -5.824 | 0.634 | -2.419 | -1.774 | 1.017 | -2.099 | -3.173 | -2.082 | 5.393 | -0.771 | 1.107 | 1.144 | 0.943 | -3.164 | -4.248 | -4.039 | 3.689 | 3.311 | 1.059 | -2.143 | 1.650 | -1.661 | 1.680 | -0.451 | -4.551 | 3.739 | 1.134 | -2.034 | 0.841 | -1.600 | -0.257 | 0.804 | 4.086 | 2.292 | 5.361 | 0.352 | 2.940 | 3.839 | -4.309 | 0 |
| 3 | 1.618 | 1.888 | 7.046 | -1.147 | 0.083 | -1.530 | 0.207 | -2.494 | 0.345 | 2.119 | -3.053 | 0.460 | 2.705 | -0.636 | -0.454 | -3.174 | -3.404 | -1.282 | 1.582 | -1.952 | -3.517 | -1.206 | -5.628 | -1.818 | 2.124 | 5.295 | 4.748 | -2.309 | -3.963 | -6.029 | 4.949 | -3.584 | -2.577 | 1.364 | 0.623 | 5.550 | -1.527 | 0.139 | 3.101 | -1.277 | 0 |
| 4 | -0.111 | 3.872 | -3.758 | -2.983 | 3.793 | 0.545 | 0.205 | 4.849 | -1.855 | -6.220 | 1.998 | 4.724 | 0.709 | -1.989 | -2.633 | 4.184 | 2.245 | 3.734 | -6.313 | -5.380 | -0.887 | 2.062 | 9.446 | 4.490 | -3.945 | 4.582 | -8.780 | -3.383 | 5.107 | 6.788 | 2.044 | 8.266 | 6.629 | -10.069 | 1.223 | -3.230 | 1.687 | -2.164 | -3.645 | 6.510 | 0 |
df_test.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.613 | -3.820 | 2.202 | 1.300 | -1.185 | -4.496 | -1.836 | 4.723 | 1.206 | -0.342 | -5.123 | 1.017 | 4.819 | 3.269 | -2.984 | 1.387 | 2.032 | -0.512 | -1.023 | 7.339 | -2.242 | 0.155 | 2.054 | -2.772 | 1.851 | -1.789 | -0.277 | -1.255 | -3.833 | -1.505 | 1.587 | 2.291 | -5.411 | 0.870 | 0.574 | 4.157 | 1.428 | -10.511 | 0.455 | -1.448 | 0 |
| 1 | 0.390 | -0.512 | 0.527 | -2.577 | -1.017 | 2.235 | -0.441 | -4.406 | -0.333 | 1.967 | 1.797 | 0.410 | 0.638 | -1.390 | -1.883 | -5.018 | -3.827 | 2.418 | 1.762 | -3.242 | -3.193 | 1.857 | -1.708 | 0.633 | -0.588 | 0.084 | 3.014 | -0.182 | 0.224 | 0.865 | -1.782 | -2.475 | 2.494 | 0.315 | 2.059 | 0.684 | -0.485 | 5.128 | 1.721 | -1.488 | 0 |
| 2 | -0.875 | -0.641 | 4.084 | -1.590 | 0.526 | -1.958 | -0.695 | 1.347 | -1.732 | 0.466 | -4.928 | 3.565 | -0.449 | -0.656 | -0.167 | -1.630 | 2.292 | 2.396 | 0.601 | 1.794 | -2.120 | 0.482 | -0.841 | 1.790 | 1.874 | 0.364 | -0.169 | -0.484 | -2.119 | -2.157 | 2.907 | -1.319 | -2.997 | 0.460 | 0.620 | 5.632 | 1.324 | -1.752 | 1.808 | 1.676 | 0 |
| 3 | 0.238 | 1.459 | 4.015 | 2.534 | 1.197 | -3.117 | -0.924 | 0.269 | 1.322 | 0.702 | -5.578 | -0.851 | 2.591 | 0.767 | -2.391 | -2.342 | 0.572 | -0.934 | 0.509 | 1.211 | -3.260 | 0.105 | -0.659 | 1.498 | 1.100 | 4.143 | -0.248 | -1.137 | -5.356 | -4.546 | 3.809 | 3.518 | -3.074 | -0.284 | 0.955 | 3.029 | -1.367 | -3.412 | 0.906 | -2.451 | 0 |
| 4 | 5.828 | 2.768 | -1.235 | 2.809 | -1.642 | -1.407 | 0.569 | 0.965 | 1.918 | -2.775 | -0.530 | 1.375 | -0.651 | -1.679 | -0.379 | -4.443 | 3.894 | -0.608 | 2.945 | 0.367 | -5.789 | 4.598 | 4.450 | 3.225 | 0.397 | 0.248 | -2.362 | 1.079 | -0.473 | 2.243 | -3.591 | 1.774 | -1.502 | -2.227 | 4.777 | -6.560 | -0.806 | -0.276 | -3.858 | -0.538 | 0 |
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 4995 non-null float64 1 V2 4994 non-null float64 2 V3 5000 non-null float64 3 V4 5000 non-null float64 4 V5 5000 non-null float64 5 V6 5000 non-null float64 6 V7 5000 non-null float64 7 V8 5000 non-null float64 8 V9 5000 non-null float64 9 V10 5000 non-null float64 10 V11 5000 non-null float64 11 V12 5000 non-null float64 12 V13 5000 non-null float64 13 V14 5000 non-null float64 14 V15 5000 non-null float64 15 V16 5000 non-null float64 16 V17 5000 non-null float64 17 V18 5000 non-null float64 18 V19 5000 non-null float64 19 V20 5000 non-null float64 20 V21 5000 non-null float64 21 V22 5000 non-null float64 22 V23 5000 non-null float64 23 V24 5000 non-null float64 24 V25 5000 non-null float64 25 V26 5000 non-null float64 26 V27 5000 non-null float64 27 V28 5000 non-null float64 28 V29 5000 non-null float64 29 V30 5000 non-null float64 30 V31 5000 non-null float64 31 V32 5000 non-null float64 32 V33 5000 non-null float64 33 V34 5000 non-null float64 34 V35 5000 non-null float64 35 V36 5000 non-null float64 36 V37 5000 non-null float64 37 V38 5000 non-null float64 38 V39 5000 non-null float64 39 V40 5000 non-null float64 40 Target 5000 non-null int64 dtypes: float64(40), int64(1) memory usage: 1.6 MB
All features are Dtype float except the Target which has a Dtype of integer.
# Copy data to avoid any changes to original date
train = df_train.copy()
test = df_test.copy()
# Use info() to print a concise summary of the DataFrame
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
train.duplicated().sum()
0
test.duplicated().sum()
0
train.isnull().sum()
V1 18 V2 18 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
round(train.isnull().sum()/train.isnull().count() * 100, 2)
V1 0.090 V2 0.090 V3 0.000 V4 0.000 V5 0.000 V6 0.000 V7 0.000 V8 0.000 V9 0.000 V10 0.000 V11 0.000 V12 0.000 V13 0.000 V14 0.000 V15 0.000 V16 0.000 V17 0.000 V18 0.000 V19 0.000 V20 0.000 V21 0.000 V22 0.000 V23 0.000 V24 0.000 V25 0.000 V26 0.000 V27 0.000 V28 0.000 V29 0.000 V30 0.000 V31 0.000 V32 0.000 V33 0.000 V34 0.000 V35 0.000 V36 0.000 V37 0.000 V38 0.000 V39 0.000 V40 0.000 Target 0.000 dtype: float64
# checking for unique values
train.nunique()
V1 19982 V2 19982 V3 20000 V4 20000 V5 20000 V6 20000 V7 20000 V8 20000 V9 20000 V10 20000 V11 20000 V12 20000 V13 20000 V14 20000 V15 20000 V16 20000 V17 20000 V18 20000 V19 20000 V20 20000 V21 20000 V22 20000 V23 20000 V24 20000 V25 20000 V26 20000 V27 20000 V28 20000 V29 20000 V30 20000 V31 20000 V32 20000 V33 20000 V34 20000 V35 20000 V36 20000 V37 20000 V38 20000 V39 20000 V40 20000 Target 2 dtype: int64
There are 36 missing values. Missing values are in V1 (18) and V2 (18).
V1 and V2 are missing .09% of their values.
All sensors except V1 and V2 have 20000 values in them.
test.isnull().sum()
V1 5 V2 6 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
round(test.isnull().sum()/test.isnull().count() * 100, 2)
V1 0.100 V2 0.120 V3 0.000 V4 0.000 V5 0.000 V6 0.000 V7 0.000 V8 0.000 V9 0.000 V10 0.000 V11 0.000 V12 0.000 V13 0.000 V14 0.000 V15 0.000 V16 0.000 V17 0.000 V18 0.000 V19 0.000 V20 0.000 V21 0.000 V22 0.000 V23 0.000 V24 0.000 V25 0.000 V26 0.000 V27 0.000 V28 0.000 V29 0.000 V30 0.000 V31 0.000 V32 0.000 V33 0.000 V34 0.000 V35 0.000 V36 0.000 V37 0.000 V38 0.000 V39 0.000 V40 0.000 Target 0.000 dtype: float64
# checking for unique values
test.nunique()
V1 4995 V2 4994 V3 5000 V4 5000 V5 5000 V6 5000 V7 5000 V8 5000 V9 5000 V10 5000 V11 5000 V12 5000 V13 5000 V14 5000 V15 5000 V16 5000 V17 5000 V18 5000 V19 5000 V20 5000 V21 5000 V22 5000 V23 5000 V24 5000 V25 5000 V26 5000 V27 5000 V28 5000 V29 5000 V30 5000 V31 5000 V32 5000 V33 5000 V34 5000 V35 5000 V36 5000 V37 5000 V38 5000 V39 5000 V40 5000 Target 2 dtype: int64
Observations:
V1 has 5 missing values .10% of their values and V2 has 6 missing values .12% of their values.
No other sensors are missing values.
train.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| V1 | 19982.000 | -0.272 | 3.442 | -11.876 | -2.737 | -0.748 | 1.840 | 15.493 |
| V2 | 19982.000 | 0.440 | 3.151 | -12.320 | -1.641 | 0.472 | 2.544 | 13.089 |
| V3 | 20000.000 | 2.485 | 3.389 | -10.708 | 0.207 | 2.256 | 4.566 | 17.091 |
| V4 | 20000.000 | -0.083 | 3.432 | -15.082 | -2.348 | -0.135 | 2.131 | 13.236 |
| V5 | 20000.000 | -0.054 | 2.105 | -8.603 | -1.536 | -0.102 | 1.340 | 8.134 |
| V6 | 20000.000 | -0.995 | 2.041 | -10.227 | -2.347 | -1.001 | 0.380 | 6.976 |
| V7 | 20000.000 | -0.879 | 1.762 | -7.950 | -2.031 | -0.917 | 0.224 | 8.006 |
| V8 | 20000.000 | -0.548 | 3.296 | -15.658 | -2.643 | -0.389 | 1.723 | 11.679 |
| V9 | 20000.000 | -0.017 | 2.161 | -8.596 | -1.495 | -0.068 | 1.409 | 8.138 |
| V10 | 20000.000 | -0.013 | 2.193 | -9.854 | -1.411 | 0.101 | 1.477 | 8.108 |
| V11 | 20000.000 | -1.895 | 3.124 | -14.832 | -3.922 | -1.921 | 0.119 | 11.826 |
| V12 | 20000.000 | 1.605 | 2.930 | -12.948 | -0.397 | 1.508 | 3.571 | 15.081 |
| V13 | 20000.000 | 1.580 | 2.875 | -13.228 | -0.224 | 1.637 | 3.460 | 15.420 |
| V14 | 20000.000 | -0.951 | 1.790 | -7.739 | -2.171 | -0.957 | 0.271 | 5.671 |
| V15 | 20000.000 | -2.415 | 3.355 | -16.417 | -4.415 | -2.383 | -0.359 | 12.246 |
| V16 | 20000.000 | -2.925 | 4.222 | -20.374 | -5.634 | -2.683 | -0.095 | 13.583 |
| V17 | 20000.000 | -0.134 | 3.345 | -14.091 | -2.216 | -0.015 | 2.069 | 16.756 |
| V18 | 20000.000 | 1.189 | 2.592 | -11.644 | -0.404 | 0.883 | 2.572 | 13.180 |
| V19 | 20000.000 | 1.182 | 3.397 | -13.492 | -1.050 | 1.279 | 3.493 | 13.238 |
| V20 | 20000.000 | 0.024 | 3.669 | -13.923 | -2.433 | 0.033 | 2.512 | 16.052 |
| V21 | 20000.000 | -3.611 | 3.568 | -17.956 | -5.930 | -3.533 | -1.266 | 13.840 |
| V22 | 20000.000 | 0.952 | 1.652 | -10.122 | -0.118 | 0.975 | 2.026 | 7.410 |
| V23 | 20000.000 | -0.366 | 4.032 | -14.866 | -3.099 | -0.262 | 2.452 | 14.459 |
| V24 | 20000.000 | 1.134 | 3.912 | -16.387 | -1.468 | 0.969 | 3.546 | 17.163 |
| V25 | 20000.000 | -0.002 | 2.017 | -8.228 | -1.365 | 0.025 | 1.397 | 8.223 |
| V26 | 20000.000 | 1.874 | 3.435 | -11.834 | -0.338 | 1.951 | 4.130 | 16.836 |
| V27 | 20000.000 | -0.612 | 4.369 | -14.905 | -3.652 | -0.885 | 2.189 | 17.560 |
| V28 | 20000.000 | -0.883 | 1.918 | -9.269 | -2.171 | -0.891 | 0.376 | 6.528 |
| V29 | 20000.000 | -0.986 | 2.684 | -12.579 | -2.787 | -1.176 | 0.630 | 10.722 |
| V30 | 20000.000 | -0.016 | 3.005 | -14.796 | -1.867 | 0.184 | 2.036 | 12.506 |
| V31 | 20000.000 | 0.487 | 3.461 | -13.723 | -1.818 | 0.490 | 2.731 | 17.255 |
| V32 | 20000.000 | 0.304 | 5.500 | -19.877 | -3.420 | 0.052 | 3.762 | 23.633 |
| V33 | 20000.000 | 0.050 | 3.575 | -16.898 | -2.243 | -0.066 | 2.255 | 16.692 |
| V34 | 20000.000 | -0.463 | 3.184 | -17.985 | -2.137 | -0.255 | 1.437 | 14.358 |
| V35 | 20000.000 | 2.230 | 2.937 | -15.350 | 0.336 | 2.099 | 4.064 | 15.291 |
| V36 | 20000.000 | 1.515 | 3.801 | -14.833 | -0.944 | 1.567 | 3.984 | 19.330 |
| V37 | 20000.000 | 0.011 | 1.788 | -5.478 | -1.256 | -0.128 | 1.176 | 7.467 |
| V38 | 20000.000 | -0.344 | 3.948 | -17.375 | -2.988 | -0.317 | 2.279 | 15.290 |
| V39 | 20000.000 | 0.891 | 1.753 | -6.439 | -0.272 | 0.919 | 2.058 | 7.760 |
| V40 | 20000.000 | -0.876 | 3.012 | -11.024 | -2.940 | -0.921 | 1.120 | 10.654 |
| Target | 20000.000 | 0.056 | 0.229 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
Observations:
The max between the features ranges from 5.671 and 17.560, excluding Target feature which is either 0 or 1.
The mean between the features ranges from -3.611 to 2.485, excluding Target feature which is either 0 or 1.
The std between the features frange from 1.652 and 4.369, excluding Target feature which is either 0 or 1.
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(data=train, x=feature, ax=ax_box2, showmeans=True, color="fuchsia", flierprops={'marker': 'o', 'markersize': 10, 'markerfacecolor': 'crimson', 'markeredgecolor':'purple'},
capprops={'color': 'teal'}, whiskerprops={'color': 'turquoise'}, medianprops ={'color' : 'blue'}
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=train, x=feature, kde=kde, ax=ax_hist2, bins=bins, stat='density', line_kws={'color': 'fuchsia','lw': 5, 'ls': ':'}
) if bins else sns.histplot(
data=train, x=feature, kde=kde, ax=ax_hist2, color="teal", ec="indigo"
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="crimson", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="purple", linestyle="-"
) # Add median to the histogram
ax_hist2.lines[0].set_color('fuchsia')
for feature in train.columns:
histogram_boxplot(train, feature, figsize=(12, 7), kde=True, bins=None)
Observations:
All the sensors are showing a normal distribution. There is a little bit of skewness on some of the sensors but not that much.
The target has a bias of zero.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 10))
else:
plt.figure(figsize=(n + 1, 10))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature, # Assign the x variable to hue
palette="magma", # Set the hue to the same variable
legend=False, # Disable the legend
order=data[feature].value_counts().index[:n].sort_values(),
)
# Annotate each bar with its count and percentage
for p in ax.patches:
prc = "{:.2f}%".format(100.0 * p.get_height() / total) # percentage
cnt = p.get_height() # count
xx = p.get_x() + p.get_width() / 2 # x coordinate of bar percentage label
yy = p.get_height() # y coordinate of bar percentage label
# Annotate percentage
ax.annotate(
prc,
(xx, yy),
ha="center",
va="center",
style="italic",
size=12,
xytext=(0, 10),
textcoords="offset points",
)
# Annotate count (adjust vertical position)
ax.annotate(
cnt,
(xx, yy + 100),
ha="center",
va="bottom", # Adjusted to display above the percentage label
size=12,
xytext=(0, 20),
textcoords="offset points",
)
# Increase y-axis size by 500
plt.ylim(0, ax.get_ylim()[1] + 500)
# shows percentage of Target split in train data
labeled_barplot(train, "Target")
# shows percentage of Target split in train data
labeled_barplot(test, "Target")
Observations:
Target for both test and train set has a 94% bias towards 0 or "not failure". Only 6% of both sets show 1 or "failure".
Heat Map Correlation Plot
plt.figure(figsize=(25, 15)) # sets size of heatmap
sns.heatmap(
data=train.corr(), annot=True, fmt=".1f", vmin=-1, vmax=1, cmap="viridis"
) # creates heatmap of correlation in the train data set
plt.show()
plt.figure(figsize=(25, 15)) # sets size of heatmap
sns.heatmap(
data=test.corr(), annot=True, fmt=".1f", vmin=-1, vmax=1, cmap="viridis"
) # creates heatmap of correlation in the train data set
plt.show()
Observations:
train1 = train.copy()
test1 = test.copy()
# prepare my data for the test train split. this includes the test dataset and the train dataset
X = train1.drop("Target", axis=1)
y = train1["Target"]
X_test = test1.drop("Target", axis=1)
y_test = test1["Target"]
# splits the data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.25, random_state=1, stratify=y
)
# prints shape of X_train, X_val, and X_test sets
print("X_train shape - ", X_train.shape)
print("X_val shape - ", X_val.shape)
print("X_test shape - ", X_test.shape)
X_train shape - (15000, 40) X_val shape - (5000, 40) X_test shape - (5000, 40)
Made a copy so there was a backup of the file.
Made adjustments to the copy. Removed the Target and split the data into test file and validation.
KNNImputer: Each sample's missing values are imputed by looking at the n_neighbors nearest neighbors found in the training set. Default value for n_neighbors=5.# sets the imputer as the KNNImputer using the 10 nearest neighbors to the data point.
imputer = KNNImputer(n_neighbors=10)
# normalizes data for KNNImputer to accuratly impute missing values
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.fit_transform(X_val)
X_test_scaled = scaler.fit_transform(X_test)
# prints missing values of X_train, X_val, and X_test sets
# 2 sensors had missing values V1 and V2
print(X_train.isna().sum()[:2], "\n", 50 * "-")
print(X_val.isna().sum()[:2], "\n", 50 * "-")
print(X_test.isna().sum()[:2], "\n", 50 * "-")
V1 15 V2 14 dtype: int64 -------------------------------------------------- V1 3 V2 4 dtype: int64 -------------------------------------------------- V1 5 V2 6 dtype: int64 --------------------------------------------------
# use the imputer to fit to X_train and then transform the missing values using KNNImputer
X_train_scaled = imputer.fit_transform(X_train_scaled)
# use the imputer to fit to X_val and then transform the missing values using KNNImputer
X_val_scaled = imputer.fit_transform(X_val_scaled)
# use the imputer to fit to X_test and then transform the missing values using KNNImputer
X_test_scaled = imputer.fit_transform(X_test_scaled)
# turns it from array back to a dataframe
X_train = pd.DataFrame(X_train_scaled, columns=X_train.columns)
# turns it from array back to a dataframe
X_val = pd.DataFrame(X_val_scaled, columns=X_val.columns)
# turns it from array back to a dataframe
X_test = pd.DataFrame(X_test_scaled, columns=X_test.columns)
# prints missing values of X_train, X_val, and X_test sets
# 2 sensors had missing values V1 and V2
print(X_train.isna().sum()[:2], "\n", 50 * "-")
print(X_val.isna().sum()[:2], "\n", 50 * "-")
print(X_test.isna().sum()[:2], "\n", 50 * "-")
V1 0 V2 0 dtype: int64 -------------------------------------------------- V1 0 V2 0 dtype: int64 -------------------------------------------------- V1 0 V2 0 dtype: int64 --------------------------------------------------
V1 and V2's missing values have now been updated. They were updated using KNN imputer with a nearest neighbors of 10.
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="",cmap='cool')
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
Let's start by building different models using KFold and cross_val_score and tune the best model using GridSearchCV and RandomizedSearchCV
Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("GBoost", GradientBoostingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("DTree", DecisionTreeClassifier(random_state=1)))
models.append(("LogRegression", LogisticRegression(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
score = []
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
score.append(scores)
print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 72.1080730106053 Random Forest: 72.35192266070268 GBoost: 70.66661857008873 AdaBoost: 63.09140754635308 XGBoost: 81.60450183969411 DTree: 69.82829521679533 LogRegression: 45.06889834788256 Validation Performance: Bagging: 0.737410071942446 Random Forest: 0.7661870503597122 GBoost: 0.7517985611510791 AdaBoost: 0.6870503597122302 XGBoost: 0.8093525179856115 DTree: 0.7230215827338129 LogRegression: 0.46402877697841727
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Observations:
Top Order (based on training):
Top Order (based on validation):
Top 3 based on training and testing is XGBoost, Random Forest and GBoost.
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 832 Before OverSampling, counts of label '0': 14168 After OverSampling, counts of label '1': 14168 After OverSampling, counts of label '0': 14168 After OverSampling, the shape of train_X: (28336, 40) After OverSampling, the shape of train_y: (28336,)
models_over = [] # Empty list to store all the models
# Appending models into the list
models_over.append(("DTree_over", DecisionTreeClassifier(random_state=1)))
models_over.append(("GBoost_over", GradientBoostingClassifier(random_state=1)))
models_over.append(("AdaBoost_over", AdaBoostClassifier(random_state=1)))
models_over.append(("Bagging_over", BaggingClassifier(random_state=1)))
models_over.append(("Random_Forest_over", RandomForestClassifier(random_state=1)))
models_over.append(("LogRegression_over", LogisticRegression(random_state=1)))
models_over.append(("XGBoost_over", XGBClassifier(disable_default_eval_metric=True, random_state=1))
)
results2 = [] # Empty list to store all model's CV scores
names_over = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance on Oversampled Data:" "\n")
for name, model in models_over:
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results2.append(cv_result)
names_over.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance on Oversampled Data:" "\n")
for name, model in models_over:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation Performance on Oversampled Data: DTree_over: 0.9705674452297639 GBoost_over: 0.9244069479551043 AdaBoost_over: 0.8963857261467018 Bagging_over: 0.9762138731419521 Random_Forest_over: 0.9818606747126131 LogRegression_over: 0.8845988440003278 XGBoost_over: 0.9893420895629467 Validation Performance on Oversampled Data: DTree_over: 0.6510791366906474 GBoost_over: 0.8381294964028777 AdaBoost_over: 0.8093525179856115 Bagging_over: 0.7266187050359713 Random_Forest_over: 0.7949640287769785 LogRegression_over: 0.8345323741007195 XGBoost_over: 0.8273381294964028
Observations:
Top Order (based on training):
Top Order (based on validation):
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))
print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before UnderSampling, counts of label '1': 832 Before UnderSampling, counts of label '0': 14168 After UnderSampling, counts of label '1': 832 After UnderSampling, counts of label '0': 832 After UnderSampling, the shape of train_X: (1664, 40) After UnderSampling, the shape of train_y: (1664,)
models_un = [] # Empty list to store all the models
# Appending models into the list
models_un.append(("DTree_under", DecisionTreeClassifier(random_state=1)))
models_un.append(("GBoost_under", GradientBoostingClassifier(random_state=1)))
models_un.append(("AdaBoost_under", AdaBoostClassifier(random_state=1)))
models_un.append(("Bagging_under", BaggingClassifier(random_state=1)))
models_un.append(("Random_Forest_under", RandomForestClassifier(random_state=1)))
models_un.append(("LogRegression_under", LogisticRegression(random_state=1)))
models_un.append(
("XGBoost_under", XGBClassifier(disable_default_eval_metric=True, random_state=1))
)
results3 = [] # Empty list to store all model's CV scores
names_un = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance on Undersampled Data:" "\n")
for name, model in models_un:
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
)
results3.append(cv_result)
names_un.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance on Undersampled Data:" "\n")
for name, model in models_un:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation Performance on Undersampled Data: DTree_under: 0.8617776495202367 GBoost_under: 0.8990621167303946 AdaBoost_under: 0.8666113556020489 Bagging_under: 0.8641945025611427 Random_Forest_under: 0.9038669648654498 LogRegression_under: 0.8690065651828872 XGBoost_under: 0.9002741504941923 Validation Performance on Undersampled Data: DTree_under: 0.7805755395683454 GBoost_under: 0.8812949640287769 AdaBoost_under: 0.8381294964028777 Bagging_under: 0.8597122302158273 Random_Forest_under: 0.8812949640287769 LogRegression_under: 0.841726618705036 XGBoost_under: 0.8848920863309353
Observations:
Top Order (based on training performance):
Top Order (based on validation performance):
Top 3 are Random Forrest, XG Boost and GBoost in training and validation.
The recall scores are ok but they can be improved.
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }
param_grid = {'C': np.arange(0.1,1.1,0.1)}
param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
Going to tune Gradient Boost, XG Boost, and Random Forest. These models had the highest validations on the original data, oversampling and undersampling.
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1) }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.6984272419017385:
# building model with best parameters
Random_Forest_tuned = RandomForestClassifier(
n_estimators= 300,
min_samples_leaf= 1,
max_samples= 0.6,
max_features= 'sqrt',
)
# Fit the model on training data
Random_Forest_tuned.fit(X_train, y_train)
RandomForestClassifier(max_samples=0.6, n_estimators=300)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.6, n_estimators=300)
## To check the performance on training set
Random_Forest_random_train = model_performance_classification_sklearn(
Random_Forest_tuned, X_train, y_train
)
Random_Forest_random_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9949 | 0.9087 | 1.0000 | 0.9521 |
confusion_matrix_sklearn(Random_Forest_tuned, X_train, y_train)
## To check the performance on validation set
Random_Forest_random_val = model_performance_classification_sklearn(
Random_Forest_tuned, X_val, y_val
)
Random_Forest_random_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9860 | 0.7626 | 0.9815 | 0.8583 |
confusion_matrix_sklearn(Random_Forest_tuned, X_val, y_val)
Training - There is overfitting precision is 100% and Accuracy is 99.49%.
Validation - Recall rate is 76.26 but all other rates are good. Recall can be improved.
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1) }
#Calling GriddSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
#Fitting parameters in GridSearchCV
grid_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Best parameters are {'max_features': array([0.3, 0.4, 0.5]), 'max_samples': 0.4, 'min_samples_leaf': 1, 'n_estimators': 250} with CV score=0.8125532068393333:
# Define the array of max_features
max_features_array = [0.3, 0.4, 0.5]
# Loop through each value in the array and build a model
for max_feature in max_features_array:
Random_Forest_tuned_grid = RandomForestClassifier(
n_estimators=250,
min_samples_leaf=1,
max_samples=0.4,
max_features=max_feature
)
# Fit the model on training data
Random_Forest_tuned_grid.fit(X_train, y_train)
RandomForestClassifier(max_features=0.5, max_samples=0.4, n_estimators=250)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_features=0.5, max_samples=0.4, n_estimators=250)
## To check the performance on training set
Random_Forest_grid_train = model_performance_classification_sklearn(
Random_Forest_tuned_grid, X_train, y_train
)
Random_Forest_grid_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9906 | 0.8341 | 0.9957 | 0.9078 |
confusion_matrix_sklearn(Random_Forest_tuned_grid, X_train, y_train)
## To check the performance on validation set
Random_Forest_grid_val = model_performance_classification_sklearn(
Random_Forest_tuned_grid, X_val, y_val
)
Random_Forest_grid_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9860 | 0.7770 | 0.9643 | 0.8606 |
confusion_matrix_sklearn(Random_Forest_tuned_grid, X_val, y_val)
Training - There is overfitting precision is 99.57% and Accuracy is 99.06%.
Validation - Recall rate is 77.70 but all other rates are good. Recall can be improved. Accuracy is a little overfitted.
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 0.2} with CV score=0.7572685953394416:
# building model with best parameters
GB_tuned = GradientBoostingClassifier(
subsample= 0.7,
n_estimators= 125,
max_features= 0.5,
learning_rate= 0.2,
)
# Fit the model on training data
GB_tuned.fit(X_train, y_train)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)## To check the performance on validation set
GB_random_train = model_performance_classification_sklearn(
GB_tuned, X_train, y_train
)
GB_random_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9927 | 0.9087 | 0.9570 | 0.9322 |
confusion_matrix_sklearn(GB_tuned, X_train, y_train)
## To check the performance on validation set
GB_random_val = model_performance_classification_sklearn(
GB_tuned, X_val, y_val
)
GB_random_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9734 | 0.7374 | 0.7736 | 0.7551 |
confusion_matrix_sklearn(GB_tuned, X_val, y_val)
Train -The Recall score is 90.87%, there is most likely overfitting since Accuracy is 99.27% and Precision is 95.7%.
Validation - Recall rate could be improved but is not bad, it is 73.74%. Accuracy which is 97.34%.
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in GridCSearchCV
param_grid = { "n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7] }
#Calling GridCSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
#Fitting parameters in GridCSearchCV
grid_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_, grid_cv.best_score_))
Best parameters are {'learning_rate': 0.2, 'max_features': 0.5, 'n_estimators': 125, 'subsample': 0.7} with CV score=0.8017386912921147:
# building model with best parameters
GB_tuned_grid = GradientBoostingClassifier(
subsample= 0.7,
n_estimators= 125,
max_features= 0.5,
learning_rate= 0.2,
)
# Fit the model on training data
GB_tuned_grid.fit(X_train, y_train)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)## To check the performance on training set
GB_grid_train = model_performance_classification_sklearn(
GB_tuned_grid, X_train, y_train
)
GB_grid_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9945 | 0.9147 | 0.9845 | 0.9483 |
confusion_matrix_sklearn(GB_tuned_grid, X_train, y_train)
## To check the performance on validation set
GB_grid_val = model_performance_classification_sklearn(
GB_tuned_grid, X_val, y_val
)
GB_grid_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9722 | 0.7554 | 0.7473 | 0.7513 |
confusion_matrix_sklearn(GB_tuned_grid, X_val, y_val)
Train -The Recall score is 91.47%, there is most likely overfitting since Accuracy is 99.45% and Precision is 98.45%.
Validation - Recall rate could be improved but is not bad, it is 75.54%. Precision and F1 rates are lower at 74.73% and 75.13%. Accuracy has a rate of 97.22%.
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
'scale_pos_weight': [5,10],
'learning_rate': [0.1,0.2],
'gamma': [0,3,5],
'subsample': [0.8,0.9] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 3} with CV score=0.8582136930957361:
# building model with best parameters
XGB_tuned = XGBClassifier(
subsample= 0.9,
scale_pos_weight = 10,
n_estimators= 200,
learning_rate= 0.1,
gamma= 3
)
# Fit the model on training data
XGB_tuned.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)## To check the performance on training set
XGB_random_train = model_performance_classification_sklearn(
XGB_tuned, X_train, y_train
)
XGB_random_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9993 | 1.0000 | 0.9870 | 0.9934 |
confusion_matrix_sklearn(XGB_tuned, X_train, y_train)
## To check the performance on validation set
XGB_random_val = model_performance_classification_sklearn(
XGB_tuned, X_val, y_val
)
XGB_random_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9872 | 0.8058 | 0.9573 | 0.8750 |
confusion_matrix_sklearn(XGB_tuned, X_val, y_val)
Train - Original Data is overfit. Everything ranges from 98.7% to 100%.
Validation - Rates are all good. Accuracy might have some overfitting but rest of the rates look good.
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in GridCSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
'scale_pos_weight': [5,10],
'learning_rate': [0.1,0.2],
'gamma': [0,3,5],
'subsample': [0.8,0.9] }
#Calling GridCSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
#Fitting parameters in GridCSearchCV
grid_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_, grid_cv.best_score_))
Best parameters are {'gamma': 3, 'learning_rate': 0.1, 'n_estimators': 200, 'scale_pos_weight': 10, 'subsample': 0.9} with CV score=0.8582136930957361:
# building model with best parameters
XGB_tuned_grid = XGBClassifier(
subsample= 0.9,
scale_pos_weight = 10,
n_estimators= 200,
learning_rate= 0.1,
gamma= 3
)
# Fit the model on training data
XGB_tuned_grid.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)## To check the performance on training set
XGB_grid_train = model_performance_classification_sklearn(
XGB_tuned_grid, X_train, y_train
)
XGB_grid_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9993 | 1.0000 | 0.9870 | 0.9934 |
confusion_matrix_sklearn(XGB_tuned_grid, X_train, y_train)
## To check the performance on validation set
XGB_grid_val = model_performance_classification_sklearn(
XGB_tuned_grid, X_val, y_val
)
XGB_grid_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9872 | 0.8058 | 0.9573 | 0.8750 |
confusion_matrix_sklearn(XGB_tuned_grid, X_val, y_val)
Train - Original Data is overfit. Everything ranges from 98.7% to 100%.
Validation - Rates are all good. Accuracy might have some overfitting but rest of the rates look good.
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1) }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 250, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.9808017764222002:
# building model with best parameters
Random_Forest_tuned_over = RandomForestClassifier(
n_estimators= 250,
min_samples_leaf= 1,
max_samples= 0.6,
max_features= 'sqrt',
)
# Fit the model on training data
Random_Forest_tuned_over.fit(X_train_over, y_train_over)
RandomForestClassifier(max_samples=0.6, n_estimators=250)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.6, n_estimators=250)
## To check the performance on training set
Random_Forest_random_train_over = model_performance_classification_sklearn(
Random_Forest_tuned_over, X_train, y_train
)
Random_Forest_random_train_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9999 | 1.0000 | 0.9988 | 0.9994 |
confusion_matrix_sklearn(Random_Forest_tuned_over, X_train, y_train)
## To check the performance on validation set
Random_Forest_random_val_over = model_performance_classification_sklearn(
Random_Forest_tuned_over, X_val, y_val
)
Random_Forest_random_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9880 | 0.8094 | 0.9698 | 0.8824 |
confusion_matrix_sklearn(Random_Forest_tuned_over, X_val, y_val)
Train - Even more overfit than the original.
Validation - Not as overfit as the training but Accuracy and Precision are still showing overfitting.
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in GridCSearchCV
param_grid = { "n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1) }
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
#Fitting parameters in GridCSearchCV
grid_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_, grid_cv.best_score_))
Best parameters are {'max_features': array([0.3, 0.4, 0.5]), 'max_samples': 0.4, 'min_samples_leaf': 1, 'n_estimators': 250} with CV score=0.8125532068393333:
# Define the array of max_features
max_features_array = [0.3, 0.4, 0.5]
# Loop through each value in the array and build a model
for max_feature in max_features_array:
Random_Forest_tuned_over_grid = RandomForestClassifier(
n_estimators=250,
min_samples_leaf=1,
max_samples=0.4,
max_features=max_feature
)
# Fit the model on training data
Random_Forest_tuned_over_grid.fit(X_train_over, y_train_over)
RandomForestClassifier(max_features=0.5, max_samples=0.4, n_estimators=250)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_features=0.5, max_samples=0.4, n_estimators=250)
## To check the performance on validation set
Random_Forest_grid_train_over = model_performance_classification_sklearn(
Random_Forest_tuned_over_grid, X_train, y_train
)
Random_Forest_grid_train_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9983 | 0.9988 | 0.9719 | 0.9852 |
confusion_matrix_sklearn(Random_Forest_tuned_over_grid, X_train, y_train)
## To check the performance on training set
Random_Forest_grid_val_over = model_performance_classification_sklearn(
Random_Forest_tuned_over_grid, X_val, y_val
)
Random_Forest_grid_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9868 | 0.8237 | 0.9309 | 0.8740 |
confusion_matrix_sklearn(Random_Forest_tuned_over_grid, X_val, y_val)
Train - Even more overfit than the original.
Validation - Not as overfit as the training but Accuracy is still showing overfitting.
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 1} with CV score=0.9698622520495789:
# building model with best parameters
GB_tuned_over = GradientBoostingClassifier(
subsample= 0.7,
n_estimators= 125,
max_features= 0.7,
learning_rate= 1,
)
# Fit the model on training data
GB_tuned_over.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=1, max_features=0.7, n_estimators=125,
subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=1, max_features=0.7, n_estimators=125,
subsample=0.7)## To check the performance on training set
GB_random_train_over = model_performance_classification_sklearn(
GB_tuned_over, X_train, y_train
)
GB_random_train_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9941 | 0.9976 | 0.9051 | 0.9491 |
confusion_matrix_sklearn(GB_tuned_over, X_train, y_train)
## To check the performance on validation set
GB_random_val_over = model_performance_classification_sklearn(
GB_tuned_over, X_val, y_val
)
GB_random_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9696 | 0.8022 | 0.6969 | 0.7458 |
confusion_matrix_sklearn(GB_tuned_over, X_val, y_val)
Train - There is overfitting in both Accuracy and Recall. Precision rates and F1 rates are good.
Validation - Recall is 80.22% is ok and the Precision has a rate of 69.69%.
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in GridCSearchCV
param_grid = { "n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7] }
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSearchCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best parameters are {'learning_rate': 0.2, 'max_features': 0.5, 'n_estimators': 125, 'subsample': 0.7} with CV score=0.8017386912921147:
# building model with best parameters
GB_tuned_over_grid = GradientBoostingClassifier(
subsample= 0.7,
n_estimators= 125,
max_features= 0.5,
learning_rate= 0.2,
)
# Fit the model on training data
GB_tuned_over_grid.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)## To check the performance on training set
GB_grid_train_over = model_performance_classification_sklearn(
GB_tuned_over_grid, X_train, y_train
)
GB_grid_train_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9855 | 0.9483 | 0.8185 | 0.8786 |
confusion_matrix_sklearn(GB_tuned_over_grid, X_train, y_train)
## To check the performance on validation set
GB_grid_val_over = model_performance_classification_sklearn(
GB_tuned_over_grid, X_val, y_val
)
GB_grid_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9798 | 0.8381 | 0.8062 | 0.8219 |
confusion_matrix_sklearn(GB_tuned_over_grid, X_val, y_val)
Train - There is a little overfitting in Accuracy. The rest of the rates are good.
Validation - Recall is 83.81% which is ok and the Precision has a rate of 80.62%.
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
'scale_pos_weight': [5,10],
'learning_rate': [0.1,0.2],
'gamma': [0,3,5],
'subsample': [0.8,0.9] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 150, 'learning_rate': 0.2, 'gamma': 0} with CV score=0.9954827929027807:
# building model with best parameters
XGB_tuned_over = XGBClassifier(
subsample= 0.9,
scale_pos_weight = 10,
n_estimators= 150,
learning_rate= 0.2,
gamma= 0
)
# Fit the model on training data
XGB_tuned_over.fit(X_train_over, y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.2, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=150, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.2, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=150, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)## To check the performance on training set
XGB_random_train_over = model_performance_classification_sklearn(
XGB_tuned_over, X_train, y_train
)
XGB_random_train_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 0.995 | 0.998 |
confusion_matrix_sklearn(XGB_tuned_over, X_train, y_train)
## To check the performance on validation set
XGB_random_val_over = model_performance_classification_sklearn(
XGB_tuned_over, X_val, y_val
)
XGB_random_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.986 | 0.817 | 0.927 | 0.868 |
confusion_matrix_sklearn(XGB_tuned_over, X_val, y_val)
Train - Overfitting in all rates.
Validation - Accuracy is showing some overfitting the rest of the rates are good.
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in GridCSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
'scale_pos_weight': [5,10],
'learning_rate': [0.1,0.2],
'gamma': [0,3,5],
'subsample': [0.8,0.9] }
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSearchCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best parameters are {'gamma': 3, 'learning_rate': 0.1, 'n_estimators': 200, 'scale_pos_weight': 10, 'subsample': 0.9} with CV score=0.8582136930957361:
# building model with best parameters
XGB_tuned_over_grid = XGBClassifier(
subsample= 0.9,
scale_pos_weight = 10,
n_estimators= 200,
learning_rate= 0.1,
gamma= 3
)
# Fit the model on training data
XGB_tuned_over_grid.fit(X_train_over, y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)## To check the performance on training set
XGB_grid_train_over = model_performance_classification_sklearn(
XGB_tuned_over_grid, X_train, y_train
)
XGB_grid_train_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9926 | 1.0000 | 0.8823 | 0.9375 |
confusion_matrix_sklearn(XGB_tuned_over_grid, X_train, y_train)
## To check the performance on validation set
XGB_grid_val_over = model_performance_classification_sklearn(
XGB_tuned_over_grid, X_val, y_val
)
XGB_grid_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9838 | 0.8489 | 0.8582 | 0.8535 |
confusion_matrix_sklearn(XGB_tuned_over_grid, X_val, y_val)
Training - Accuracy and Recall are both overfitted. Other rates are good.
Validation - Accuracy showing some overfitting. Other rates are good.
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1) }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 2, 'max_samples': 0.5, 'max_features': 'sqrt'} with CV score=0.8990188298102589:
# building model with best parameters
Random_Forest_tuned_under = RandomForestClassifier(
n_estimators= 300,
min_samples_leaf= 2,
max_samples= 0.5,
max_features= 'sqrt',
)
# Fit the model on training data
Random_Forest_tuned_under.fit(X_train_un,y_train_un)
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=300)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=300)
## To check the performance on training set
Random_Forest_random_train_under = model_performance_classification_sklearn(
Random_Forest_tuned_under, X_train, y_train
)
Random_Forest_random_train_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9460 | 0.9339 | 0.5072 | 0.6574 |
confusion_matrix_sklearn(Random_Forest_tuned_under, X_train, y_train)
## To check the performance on validation set
Random_Forest_random_val_under = model_performance_classification_sklearn(
Random_Forest_tuned_under, X_val, y_val
)
Random_Forest_random_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9440 | 0.8777 | 0.4980 | 0.6354 |
confusion_matrix_sklearn(Random_Forest_tuned_under, X_val, y_val)
Train - Recall rate is good but Precision is very low, only 50.72%.
Validation - Recall rate is good but Precision is very low, only 49.80%.
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in GridCSearchCV
param_grid = { "n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1) }
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSearchCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 5 folds for each of 54 candidates, totalling 270 fits
Best parameters are {'max_features': array([0.3, 0.4, 0.5]), 'max_samples': 0.4, 'min_samples_leaf': 1, 'n_estimators': 250} with CV score=0.8125532068393333:
# building model with best parameters
Random_Forest_tuned_under_grid = RandomForestClassifier(
n_estimators= 250,
min_samples_leaf= 2,
max_samples= 0.5,
max_features= 'sqrt',
)
# Fit the model on training data
Random_Forest_tuned_under_grid.fit(X_train_un,y_train_un)
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=250)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=250)
## To check the performance on training set
Random_Forest_grid_train_under = model_performance_classification_sklearn(
Random_Forest_tuned_under_grid, X_train, y_train
)
Random_Forest_grid_train_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9478 | 0.9327 | 0.5163 | 0.6647 |
confusion_matrix_sklearn(Random_Forest_tuned_under_grid, X_train, y_train)
## To check the performance on validation set
Random_Forest_grid_val_under = model_performance_classification_sklearn(
Random_Forest_tuned_under_grid, X_val, y_val
)
Random_Forest_grid_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9468 | 0.8777 | 0.5126 | 0.6472 |
confusion_matrix_sklearn(Random_Forest_tuned_under_grid, X_val, y_val)
Training - Recall is good at 93.27%, but Precision is low at 51.63%.
Validation - Recall is good at 87.77%, but Precision is low at 51.26%.
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.5, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2} with CV score=0.9014284683644759:
# building model with best parameters
GB_tuned_under = GradientBoostingClassifier(
subsample= 0.7,
n_estimators= 125,
max_features= 0.5,
learning_rate= 0.2,
)
# Fit the model on training data
GB_tuned_under.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)## To check the performance on training set
GB_random_train_under = model_performance_classification_sklearn(
GB_tuned_under, X_train, y_train
)
GB_random_train_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9849 | 0.9543 | 0.8077 | 0.8749 |
confusion_matrix_sklearn(GB_tuned_under, X_train, y_train)
## To check the performance on validation set
GB_random_val_under = model_performance_classification_sklearn(
GB_tuned_under, X_val, y_val
)
GB_random_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9782 | 0.8345 | 0.7864 | 0.8098 |
confusion_matrix_sklearn(GB_tuned_under, X_val, y_val)
Train - All have decent rates. Recall rate is 95.43 and Precision is 80.77. Overfitting has been minimized.
Validation - All have decent rates. Recall rate is 83.45 and Precision is 78.64. Overfitting has been minimized.
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in GridCSearchCV
param_grid = { "n_estimators": np.arange(100,150,25),
"learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7],
"max_features":[0.5,0.7] }
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSearchCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best parameters are {'learning_rate': 0.2, 'max_features': 0.5, 'n_estimators': 125, 'subsample': 0.7} with CV score=0.7572685953394416:
# building model with best parameters
GB_tuned_under_grid = GradientBoostingClassifier(
subsample= 0.7,
n_estimators= 125,
max_features= 0.5,
learning_rate= 0.2,
)
# Fit the model on training data
GB_tuned_under_grid.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, subsample=0.7)## To check the performance on training set
GB_grid_train_under = model_performance_classification_sklearn(
GB_tuned_under_grid, X_train, y_train
)
GB_grid_train_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9845 | 0.9567 | 0.8016 | 0.8723 |
confusion_matrix_sklearn(GB_tuned_under_grid, X_train, y_train)
## To check the performance on validation set
GB_grid_val_under = model_performance_classification_sklearn(
GB_tuned_under_grid, X_val, y_val
)
GB_grid_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9804 | 0.8453 | 0.8103 | 0.8275 |
confusion_matrix_sklearn(GB_tuned_under_grid, X_val, y_val)
Training - Accuracy is a little overfit but the rest of the rates look good.
Validation - All rates are good. Recall rate is 84.53% and Precision is 81.03%.
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
'scale_pos_weight': [5,10],
'learning_rate': [0.1,0.2],
'gamma': [0,3,5],
'subsample': [0.8,0.9] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=20, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 5, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9290743813577663:
# building model with best parameters
XGB_tuned_under = XGBClassifier(
subsample= 0.9,
scale_pos_weight = 5,
n_estimators= 200,
learning_rate= 0.1,
gamma= 5
)
# Fit the model on training data
XGB_tuned_under.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)## To check the performance on training set
XGB_random_train_under = model_performance_classification_sklearn(
XGB_tuned_under, X_train, y_train
)
XGB_random_train_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.8759 | 0.9988 | 0.3088 | 0.4718 |
confusion_matrix_sklearn(XGB_tuned_under, X_train, y_train)
## To check the performance on validation set
XGB_random_val_under = model_performance_classification_sklearn(
XGB_tuned_under, X_val, y_val
)
XGB_random_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.8808 | 0.9065 | 0.3066 | 0.4582 |
confusion_matrix_sklearn(XGB_tuned_under, X_val, y_val)
Train - Recall is showing overfitting and the Precision rate is not acceptable at 30.88%.
Validation - Recall rate is good, however the Precision rate is not acceptable at 45.82%.
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in GridCSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
'scale_pos_weight': [5,10],
'learning_rate': [0.1,0.2],
'gamma': [0,3,5],
'subsample': [0.8,0.9] }
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1, verbose= 2)
#Fitting parameters in GridSearchCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best parameters are {'gamma': 3, 'learning_rate': 0.1, 'n_estimators': 200, 'scale_pos_weight': 10, 'subsample': 0.9} with CV score=0.8582136930957361:
# building model with best parameters
XGB_tuned_under_grid = XGBClassifier(
subsample= 0.9,
scale_pos_weight = 10,
n_estimators= 200,
learning_rate= 0.1,
gamma= 3
)
# Fit the model on training data
XGB_tuned_under_grid.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)## To check the performance on training set
XGB_grid_train_under = model_performance_classification_sklearn(
XGB_tuned_under_grid, X_train, y_train
)
XGB_grid_train_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.8765 | 1.0000 | 0.3099 | 0.4731 |
confusion_matrix_sklearn(XGB_tuned_under_grid, X_train, y_train)
## To check the performance on validation set
XGB_grid_val_under = model_performance_classification_sklearn(
XGB_tuned_under_grid, X_val, y_val
)
XGB_grid_val_under
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.8872 | 0.9065 | 0.3190 | 0.4719 |
confusion_matrix_sklearn(XGB_tuned_under_grid, X_val, y_val)
Training - Recall is overfitted and Precision has a rate of 30.99% which is not good.
Validation - Recall rate is good, but the Precision rate is 31.90% which is not good at all.
# Set the display format to show 4 decimal places
pd.options.display.float_format = '{:.4f}'.format
# Validation performance comparison
models_train_comp_df = pd.concat(
[
Random_Forest_random_train.T,
Random_Forest_grid_train.T,
GB_random_train.T,
GB_grid_train.T,
XGB_random_train.T,
XGB_grid_train.T,
Random_Forest_random_train_over.T,
Random_Forest_grid_train_over.T,
GB_random_train_over.T,
GB_grid_train_over.T,
XGB_random_train_over.T,
XGB_grid_train_over.T,
Random_Forest_random_train_under.T,
Random_Forest_grid_train_under.T,
GB_random_train_under.T,
GB_grid_train_under.T,
XGB_random_train_under.T,
XGB_grid_train_under.T
],
axis=1,
)
models_train_comp_df.columns = [
"Random Forest Tuned with Random Search",
"Random Forest Tuned with Grid Search",
"Gradient Boost Tuned with Random Search",
"Gradient Boost Tuned with Grid Search",
"XG Boost Tuned with Random Search",
"XG Boost Tuned with Grid Search",
"Random Forest Tuned with Oversampling with Random Search",
"Random Forest Tuned with Oversampling with Grid Search",
"Gradient Boost Tuned with Oversampling with Random Search",
"Gradient Boost Tuned with Oversampling with Grid Search",
"XG Boost with Oversampling with Random Search",
"XG Boost with Oversampling with Grid Search",
"Random Forest Tuned with Undersampling with Random Search",
"Random Forest Tuned with Undersampling with Grid Search",
"Gradient Boost Tuned with Undersampling with Random Search",
"Gradient Boost Tuned with Undersampling with Grid Search",
"XG Booat with Undersampling with Random Search",
"XG Booat with Undersampling with Grid Search",
]
print("Training performance comparison:")
models_train_comp_df.transpose().round(4).style.set_table_styles([dict(selector="th", props=[('max-width', '150px')])])
Training performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Random Forest Tuned with Random Search | 0.994900 | 0.908700 | 1.000000 | 0.952100 |
| Random Forest Tuned with Grid Search | 0.990600 | 0.834100 | 0.995700 | 0.907800 |
| Gradient Boost Tuned with Random Search | 0.992700 | 0.908700 | 0.957000 | 0.932200 |
| Gradient Boost Tuned with Grid Search | 0.994500 | 0.914700 | 0.984500 | 0.948300 |
| XG Boost Tuned with Random Search | 0.999300 | 1.000000 | 0.987000 | 0.993400 |
| XG Boost Tuned with Grid Search | 0.999300 | 1.000000 | 0.987000 | 0.993400 |
| Random Forest Tuned with Oversampling with Random Search | 0.999900 | 1.000000 | 0.998800 | 0.999400 |
| Random Forest Tuned with Oversampling with Grid Search | 0.998300 | 0.998800 | 0.971900 | 0.985200 |
| Gradient Boost Tuned with Oversampling with Random Search | 0.994100 | 0.997600 | 0.905100 | 0.949100 |
| Gradient Boost Tuned with Oversampling with Grid Search | 0.985500 | 0.948300 | 0.818500 | 0.878600 |
| XG Boost with Oversampling with Random Search | 0.999700 | 1.000000 | 0.995200 | 0.997600 |
| XG Boost with Oversampling with Grid Search | 0.992600 | 1.000000 | 0.882300 | 0.937500 |
| Random Forest Tuned with Undersampling with Random Search | 0.946000 | 0.933900 | 0.507200 | 0.657400 |
| Random Forest Tuned with Undersampling with Grid Search | 0.947800 | 0.932700 | 0.516300 | 0.664700 |
| Gradient Boost Tuned with Undersampling with Random Search | 0.984900 | 0.954300 | 0.807700 | 0.874900 |
| Gradient Boost Tuned with Undersampling with Grid Search | 0.984500 | 0.956700 | 0.801600 | 0.872300 |
| XG Booat with Undersampling with Random Search | 0.875900 | 0.998800 | 0.308800 | 0.471800 |
| XG Booat with Undersampling with Grid Search | 0.876500 | 1.000000 | 0.309900 | 0.473100 |
Every model is showing overfitting in either Accuracy or Recall.
Gradient Boost Tuned with Undersampling model using Grid search. It has a recall of 95.43%. The values for Precision is 98.49%, F1 is 87.49%. Accuracy is showing a rate of 98.49%, which is slightly overfit but not terrible when looking at everything as a whole.
# Set the display format to show 4 decimal places
pd.options.display.float_format = '{:.4f}'.format
# Validation performance comparison
models_val_comp_df = pd.concat(
[
Random_Forest_random_val.T,
Random_Forest_grid_val.T,
GB_random_val.T,
GB_grid_val.T,
XGB_random_val.T,
XGB_grid_val. T,
Random_Forest_random_val_over.T,
Random_Forest_grid_val_over.T,
GB_random_val_over.T,
GB_grid_val_over.T,
XGB_random_val_over.T,
XGB_grid_val_over.T,
Random_Forest_random_val_under.T,
Random_Forest_grid_val_under.T,
GB_random_val_under.T,
GB_grid_val_under.T,
XGB_random_val_under.T,
XGB_grid_val_under.T
],
axis=1,
)
models_val_comp_df.columns = [
"Random Forest Tuned with Random Search",
"Random Forest Tuned with Grid Search",
"Gradient Boost Tuned with Random Search",
"Gradient Boost Tuned with Grid Search",
"XG Boost Tuned with Random Search",
"XG Boost Tuned with Grid Search",
"Random Forest Tuned with Oversampling with Random Search",
"Random Forest Tuned with Oversampling with Grid Search",
"Gradient Boost Tuned with Oversampling with Random Search",
"Gradient Boost Tuned with Oversampling with Grid Search",
"XG Boost with Oversampling with Random Search",
"XG Boost with Oversampling with Grid Search",
"Random Forest Tuned with Undersampling with Random Search",
"Random Forest Tuned with Undersampling with Grid Search",
"Gradient Boost Tuned with Undersampling with Random Search",
"Gradient Boost Tuned with Undersampling with Grid Search",
"XG Booat with Undersampling with Random Search",
"XG Booat with Undersampling with Grid Search",
]
print("Validation performance comparison:")
# Wrap column names
models_val_comp_df.transpose().round(4).style.set_table_styles([dict(selector="th", props=[('max-width', '150px')])])
Validation performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Random Forest Tuned with Random Search | 0.986000 | 0.762600 | 0.981500 | 0.858300 |
| Random Forest Tuned with Grid Search | 0.986000 | 0.777000 | 0.964300 | 0.860600 |
| Gradient Boost Tuned with Random Search | 0.973400 | 0.737400 | 0.773600 | 0.755100 |
| Gradient Boost Tuned with Grid Search | 0.972200 | 0.755400 | 0.747300 | 0.751300 |
| XG Boost Tuned with Random Search | 0.987200 | 0.805800 | 0.957300 | 0.875000 |
| XG Boost Tuned with Grid Search | 0.987200 | 0.805800 | 0.957300 | 0.875000 |
| Random Forest Tuned with Oversampling with Random Search | 0.988000 | 0.809400 | 0.969800 | 0.882400 |
| Random Forest Tuned with Oversampling with Grid Search | 0.986800 | 0.823700 | 0.930900 | 0.874000 |
| Gradient Boost Tuned with Oversampling with Random Search | 0.969600 | 0.802200 | 0.696900 | 0.745800 |
| Gradient Boost Tuned with Oversampling with Grid Search | 0.979800 | 0.838100 | 0.806200 | 0.821900 |
| XG Boost with Oversampling with Random Search | 0.986200 | 0.816500 | 0.926500 | 0.868100 |
| XG Boost with Oversampling with Grid Search | 0.983800 | 0.848900 | 0.858200 | 0.853500 |
| Random Forest Tuned with Undersampling with Random Search | 0.944000 | 0.877700 | 0.498000 | 0.635400 |
| Random Forest Tuned with Undersampling with Grid Search | 0.946800 | 0.877700 | 0.512600 | 0.647200 |
| Gradient Boost Tuned with Undersampling with Random Search | 0.978200 | 0.834500 | 0.786400 | 0.809800 |
| Gradient Boost Tuned with Undersampling with Grid Search | 0.980400 | 0.845300 | 0.810300 | 0.827500 |
| XG Booat with Undersampling with Random Search | 0.880800 | 0.906500 | 0.306600 | 0.458200 |
| XG Booat with Undersampling with Grid Search | 0.887200 | 0.906500 | 0.319000 | 0.471900 |
XGBoost with Undersampling modul using random or grid search, however both are showing Precision of 30.66% and 31.90% for Precision which is not good at all.
The next best - that has a good recall and precision rate - is Gradient Boost Tuned with Undersampling model using Grid search. It has a recall of 84.53%. The values for Precision is 81.03%, F1 is 82.75%. Accuracy is showing a rate of 98.04%, which is slightly overfit but not terrible when looking at everything as a whole.
All the models are showing some overfitting. XGBoost with undersampling Random and Grid Search, Random Forrest with undersampling Random and Grid Search are showing the least amount of overfitting but their Precision is low which makes them not good models.
# joins the validation and test dataset together to make a DataFrame that is easier to interpret
model_train_val_perf_comp = models_train_comp_df.join(
models_val_comp_df, lsuffix="_train", rsuffix="_val"
)
# code will combine train and validation columns together later on to easier interpret data
model_train_val_perf_comp.transpose()
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Random Forest Tuned with Random Search_train | 0.9949 | 0.9087 | 1.0000 | 0.9521 |
| Random Forest Tuned with Grid Search_train | 0.9903 | 0.8269 | 0.9985 | 0.9047 |
| Gradient Boost Tuned with Random Search_train | 0.9927 | 0.9087 | 0.9570 | 0.9322 |
| Gradient Boost Tuned with Grid Search_train | 0.9945 | 0.9147 | 0.9845 | 0.9483 |
| XG Boost Tuned with Random Search_train | 0.9993 | 1.0000 | 0.9870 | 0.9934 |
| XG Boost Tuned with Grid Search_train | 0.9993 | 1.0000 | 0.9870 | 0.9934 |
| Random Forest Tuned with Oversampling with Random Search_train | 0.9999 | 1.0000 | 0.9988 | 0.9994 |
| Random Forest Tuned with Oversampling with Grid Search_train | 0.9985 | 0.9988 | 0.9754 | 0.9869 |
| Gradient Boost Tuned with Oversampling with Random Search_train | 0.9941 | 0.9976 | 0.9051 | 0.9491 |
| Gradient Boost Tuned with Oversampling with Grid Search_train | 0.9855 | 0.9483 | 0.8185 | 0.8786 |
| XG Boost with Oversampling with Random Search_train | 0.9997 | 1.0000 | 0.9952 | 0.9976 |
| XG Boost with Oversampling with Grid Search_train | 0.9926 | 1.0000 | 0.8823 | 0.9375 |
| Random Forest Tuned with Undersampling with Random Search_train | 0.9460 | 0.9339 | 0.5072 | 0.6574 |
| Random Forest Tuned with Undersampling with Grid Search_train | 0.9478 | 0.9327 | 0.5163 | 0.6647 |
| Gradient Boost Tuned with Undersampling with Random Search_train | 0.9849 | 0.9543 | 0.8077 | 0.8749 |
| Gradient Boost Tuned with Undersampling with Grid Search_train | 0.9845 | 0.9567 | 0.8016 | 0.8723 |
| XG Booat with Undersampling with Random Search_train | 0.8759 | 0.9988 | 0.3088 | 0.4718 |
| XG Booat with Undersampling with Grid Search_train | 0.8765 | 1.0000 | 0.3099 | 0.4731 |
| Random Forest Tuned with Random Search_val | 0.9860 | 0.7626 | 0.9815 | 0.8583 |
| Random Forest Tuned with Grid Search_val | 0.9860 | 0.7770 | 0.9643 | 0.8606 |
| Gradient Boost Tuned with Random Search_val | 0.9734 | 0.7374 | 0.7736 | 0.7551 |
| Gradient Boost Tuned with Grid Search_val | 0.9722 | 0.7554 | 0.7473 | 0.7513 |
| XG Boost Tuned with Random Search_val | 0.9872 | 0.8058 | 0.9573 | 0.8750 |
| XG Boost Tuned with Grid Search_val | 0.9872 | 0.8058 | 0.9573 | 0.8750 |
| Random Forest Tuned with Oversampling with Random Search_val | 0.9880 | 0.8094 | 0.9698 | 0.8824 |
| Random Forest Tuned with Oversampling with Grid Search_val | 0.9882 | 0.8345 | 0.9469 | 0.8872 |
| Gradient Boost Tuned with Oversampling with Random Search_val | 0.9696 | 0.8022 | 0.6969 | 0.7458 |
| Gradient Boost Tuned with Oversampling with Grid Search_val | 0.9798 | 0.8381 | 0.8062 | 0.8219 |
| XG Boost with Oversampling with Random Search_val | 0.9862 | 0.8165 | 0.9265 | 0.8681 |
| XG Boost with Oversampling with Grid Search_val | 0.9838 | 0.8489 | 0.8582 | 0.8535 |
| Random Forest Tuned with Undersampling with Random Search_val | 0.9440 | 0.8777 | 0.4980 | 0.6354 |
| Random Forest Tuned with Undersampling with Grid Search_val | 0.9468 | 0.8777 | 0.5126 | 0.6472 |
| Gradient Boost Tuned with Undersampling with Random Search_val | 0.9782 | 0.8345 | 0.7864 | 0.8098 |
| Gradient Boost Tuned with Undersampling with Grid Search_val | 0.9804 | 0.8453 | 0.8103 | 0.8275 |
| XG Booat with Undersampling with Random Search_val | 0.8808 | 0.9065 | 0.3066 | 0.4582 |
| XG Booat with Undersampling with Grid Search_val | 0.8872 | 0.9065 | 0.3190 | 0.4719 |
Observations:
Estimated costs:
Maintenance cost = TP(Repair cost) + FN(Replacement cost) + FP*(Inspection cost) where,
Calculate total service cost
The value of this ratio will range between 0 and 1. The ratio will be 1 only when the maintenance cost of the model equals the minimum possible maintenance cost..
def calculate_total_cost(model, predictors, targets):
"""
Calculates the total cost of each model based on the following:
Replacement cost = $40,000
Repair cost = $15,000
Inspection cost = $5,000
model: model used to calculate generator failure
predictors: independant variables
targets: dependant variable
"""
# creates a flattened confusion matrix
cm_flat = confusion_matrix(targets, model.predict(predictors)).flatten()
Inspection_Cost = cm_flat[1] * 5000 # calculates total cost for False Positives
Replacement_Cost = cm_flat[2] * 40000 # calculates total cost for False Negatives
Repair_Cost = cm_flat[3] * 20000 # calculates total cost for True Positives
# sums up total service cost of False Positives, False Negatives, and True Positives
total_cost = Repair_Cost + Replacement_Cost + Inspection_Cost
return total_cost
# creates DataFrame containing the total estomated service cost for each tuned predicitve model
df_total_cost_per_model = pd.DataFrame(
[
calculate_total_cost(Random_Forest_tuned, X_val, y_val),
calculate_total_cost(Random_Forest_tuned_grid, X_val, y_val),
calculate_total_cost(GB_tuned, X_val, y_val),
calculate_total_cost(GB_tuned_grid, X_val, y_val),
calculate_total_cost(XGB_tuned, X_val, y_val),
calculate_total_cost(XGB_tuned_grid, X_val, y_val),
calculate_total_cost(Random_Forest_tuned_over, X_val, y_val),
calculate_total_cost(Random_Forest_tuned_over_grid, X_val, y_val),
calculate_total_cost(GB_tuned_over, X_val, y_val),
calculate_total_cost(GB_tuned_over_grid, X_val, y_val),
calculate_total_cost(XGB_tuned_over, X_val, y_val),
calculate_total_cost(XGB_tuned_over_grid, X_val, y_val),
calculate_total_cost(Random_Forest_tuned_under, X_val, y_val),
calculate_total_cost(Random_Forest_tuned_under_grid, X_val, y_val),
calculate_total_cost(GB_tuned_over, X_val, y_val),
calculate_total_cost(GB_tuned_over_grid, X_val, y_val),
calculate_total_cost(XGB_tuned_under, X_val, y_val),
calculate_total_cost(XGB_tuned_under_grid, X_val, y_val),
],
index=[
"Random Forest Tuned with Random Search",
"Random Forest Tuned with Grid Search",
"Gradient Boost Tuned with Random Search",
"Gradient Boost Tuned with Grid Search",
"XG Boost Tuned with Random Search",
"XG Boost Tuned with Grid Search",
"Random Forest Tuned with Oversampling with Random Search",
"Random Forest Tuned with Oversampling with Grid Search",
"Gradient Boost Tuned with Oversampling with Random Search",
"Gradient Boost Tuned with Oversampling with Grid Search",
"XG Boost with Oversampling with Random Search",
"XG Boost with Oversampling with Grid Search",
"Random Forest Tuned with Undersampling with Random Search",
"Random Forest Tuned with Undersampling with Grid Search",
"Gradient Boost Tuned with Undersampling with Random Search",
"Gradient Boost Tuned with Undersampling with Grid Search",
"XG Booat with Undersampling with Random Search",
"XG Booat with Undersampling with Grid Search",
], # sets index names
columns=["Total Service Cost"], # sets column names
)
df_total_cost_per_model
| Total Service Cost | |
|---|---|
| Random Forest Tuned with Random Search | 6900000 |
| Random Forest Tuned with Grid Search | 6840000 |
| Gradient Boost Tuned with Random Search | 7320000 |
| Gradient Boost Tuned with Grid Search | 7275000 |
| XG Boost Tuned with Random Search | 6690000 |
| XG Boost Tuned with Grid Search | 6690000 |
| Random Forest Tuned with Oversampling with Random Search | 6655000 |
| Random Forest Tuned with Oversampling with Grid Search | 6625000 |
| Gradient Boost Tuned with Oversampling with Random Search | 7145000 |
| Gradient Boost Tuned with Oversampling with Grid Search | 6740000 |
| XG Boost with Oversampling with Random Search | 6670000 |
| XG Boost with Oversampling with Grid Search | 6595000 |
| Random Forest Tuned with Undersampling with Random Search | 7470000 |
| Random Forest Tuned with Undersampling with Grid Search | 7400000 |
| Gradient Boost Tuned with Undersampling with Random Search | 7145000 |
| Gradient Boost Tuned with Undersampling with Grid Search | 6740000 |
| XG Booat with Undersampling with Random Search | 8930000 |
| XG Booat with Undersampling with Grid Search | 8770000 |
Based on cost:
Best Recommended Model:
Even though it does not have the best Recall or Precision rate, XGBoost with Oversampling with Grid Search is the best model overall.
2nd Best Model: Random Forest with Oversampling with Grid Search is the best model overall.
Testing XG Boost with Oversampling with Grid Search (best total cost), Random Forest with Oversampling Grid Search (2nd best total cost), Random Forest Tuning with Random Search (best Precision), and Gradient Boost Tuned with Undersampling with Grid Search (best Recall) before making a final recommendation.
# Calculating different metrics on the test set
xgb_over_random_test = model_performance_classification_sklearn(XGB_tuned_over_grid, X_test, y_test)
print("Test performance:")
xgb_over_random_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9590 | 0.8723 | 0.5928 | 0.7059 |
confusion_matrix_sklearn(XGB_tuned_over_grid, X_test, y_test)
print(
"Total service cost for tuned XGBoost Tuned with Oversampling with Grid Search on test data: ${:,}".format(
calculate_total_cost(XGB_tuned_over_grid, X_test, y_test)
),
)
Total service cost for tuned XGBoost Tuned with Oversampling with Grid Search on test data: $7,205,000
feature_names = X.columns
importances = XGB_tuned_over_grid.feature_importances_
indices = np.argsort(importances)
# creates plot showing model feature importance
plt.figure(figsize=(15, 20)) # sets figure size
plt.barh(
y=range(len(indices)), width=importances[indices], color="teal"
) # creates bars for plot
plt.title("Feature Importances") # sets title
plt.yticks(
ticks=range(len(feature_names)), labels=[feature_names[i] for i in indices]
) # sets y ticks as individual features
plt.show()
Recall 87.23% is good but Precision 59.28% is low. Estimated cost is $7,205,000.
# Calculating different metrics on the test set
random_forest_over_grid_test = model_performance_classification_sklearn(Random_Forest_tuned_over_grid, X_test, y_test)
print("Test performance:")
random_forest_over_grid_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9752 | 0.8085 | 0.7651 | 0.7862 |
confusion_matrix_sklearn(Random_Forest_tuned_over_grid, X_test, y_test)
print(
"Total service cost for tuned Random Forest Tuned with Oversampling with Grid Search on test data: ${:,}".format(
calculate_total_cost(Random_Forest_tuned_over_grid, X_test, y_test)
),
)
Total service cost for tuned Random Forest Tuned with Oversampling with Grid Search on test data: $7,070,000
feature_names = X.columns
importances = Random_Forest_tuned_over_grid.feature_importances_
indices = np.argsort(importances)
# creates plot showing model feature importance
plt.figure(figsize=(15, 20)) # sets figure size
plt.barh(
y=range(len(indices)), width=importances[indices], color="teal"
) # creates bars for plot
plt.title("Feature Importances") # sets title
plt.yticks(
ticks=range(len(feature_names)), labels=[feature_names[i] for i in indices]
) # sets y ticks as individual features
plt.show()
Recall 80.85% is good but Precision 76.51% is ok. Estimated cost is $7,070,000.
# Calculating different metrics on the test set
random_forest_random_test = model_performance_classification_sklearn(Random_Forest_tuned, X_test, y_test)
print("Test performance:")
random_forest_random_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9834 | 0.7943 | 0.8996 | 0.8437 |
print(
"Total service cost for tuned Random Forest Tuned on test data: ${:,}".format(
calculate_total_cost(Random_Forest_tuned, X_test, y_test)
),
)
Total service cost for tuned Random Forest Tuned on test data: $6,925,000
feature_names = X.columns
importances = Random_Forest_tuned.feature_importances_
indices = np.argsort(importances)
# creates plot showing model feature importance
plt.figure(figsize=(15, 20)) # sets figure size
plt.barh(
y=range(len(indices)), width=importances[indices], color="teal"
) # creates bars for plot
plt.title("Feature Importances") # sets title
plt.yticks(
ticks=range(len(feature_names)), labels=[feature_names[i] for i in indices]
) # sets y ticks as individual features
plt.show()
Recall 79.43% is good but Precision 89.96% is low. Estimated cost is $6,925,000.
# Calculating different metrics on the test set
GB_under_grid_test = model_performance_classification_sklearn(GB_tuned_under_grid, X_test, y_test)
print("Test performance:")
GB_under_grid_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9472 | 0.8333 | 0.5199 | 0.6403 |
print(
"Total service cost for tuned Gradient Boost Tuned with Undersampling with Grid Search on test data: ${:,}".format(
calculate_total_cost(GB_tuned_under_grid, X_test, y_test)
),
)
Total service cost for tuned Gradient Boost Tuned with Undersampling with Grid Search on test data: $7,665,000
feature_names = X.columns
importances = GB_tuned_under_grid.feature_importances_
indices = np.argsort(importances)
# creates plot showing model feature importance
plt.figure(figsize=(15, 20)) # sets figure size
plt.barh(
y=range(len(indices)), width=importances[indices], color="teal"
) # creates bars for plot
plt.title("Feature Importances") # sets title
plt.yticks(
ticks=range(len(feature_names)), labels=[feature_names[i] for i in indices]
) # sets y ticks as individual features
plt.show()
Recall 83.33% is good but Precision 51.99% is low. Estimated cost is $7,665,000.
Observations:
Best Model appears to be the Random Forest Tuned with Random Search model.
Based on test data:
The feature importances showed that sensor V18 by far is the most important sensor when determining the failure of the generator.
# creates list of features to transform
col_list = [
"V1",
"V2",
"V3",
"V4",
"V5",
"V6",
"V7",
"V8",
"V9",
"V10",
"V11",
"V12",
"V13",
"V14",
"V15",
"V16",
"V17",
"V18",
"V19",
"V20",
"V21",
"V22",
"V23",
"V24",
"V25",
"V26",
"V27",
"V28",
"V29",
"V30",
"V31",
"V32",
"V33",
"V34",
"V35",
"V36",
"V37",
"V38",
"V39",
"V40",
]
MinMax_transformer = make_pipeline(MinMaxScaler())
# sets transformer as KNNImputer for the ColumnTransformer
KNN_transformer = make_pipeline(KNNImputer(n_neighbors=10))
# sets ColumnTransfomer as preprocessor to impute model features using KNN
preprocessor = ColumnTransformer(
transformers=[
("scaler", MinMax_transformer, col_list),
("imputer", KNN_transformer, col_list),
],
remainder="passthrough",
)
X_train_pipeline = X_train
y_train_pipeline = y_train
X_test_pipeline = X_test
y_test_pipeline = y_test
# sets pipeline as name model
model = Pipeline(
steps=[
("pre", preprocessor), # Imputes missing values with KNNImputer
(
"Random Forest",
RandomForestClassifier(
n_estimators= 300,
min_samples_leaf= 1,
max_samples= 0.6,
max_features= 'sqrt',
), # the model being used in the pipeline
),
]
)
model.fit(X_train_pipeline, y_train_pipeline) # fits the model to the training data
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('scaler',
Pipeline(steps=[('minmaxscaler',
MinMaxScaler())]),
['V1', 'V2', 'V3', 'V4', 'V5',
'V6', 'V7', 'V8', 'V9',
'V10', 'V11', 'V12', 'V13',
'V14', 'V15', 'V16', 'V17',
'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25',
'V26', 'V27', 'V28', 'V29',
'V30', ...]),
('imputer',
Pipeline(steps=[('knnimputer',
KNNImputer(n_neighbors=10))]),
['V1', 'V2', 'V3', 'V4', 'V5',
'V6', 'V7', 'V8', 'V9',
'V10', 'V11', 'V12', 'V13',
'V14', 'V15', 'V16', 'V17',
'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25',
'V26', 'V27', 'V28', 'V29',
'V30', ...])])),
('Random Forest',
RandomForestClassifier(max_samples=0.6, n_estimators=300))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('scaler',
Pipeline(steps=[('minmaxscaler',
MinMaxScaler())]),
['V1', 'V2', 'V3', 'V4', 'V5',
'V6', 'V7', 'V8', 'V9',
'V10', 'V11', 'V12', 'V13',
'V14', 'V15', 'V16', 'V17',
'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25',
'V26', 'V27', 'V28', 'V29',
'V30', ...]),
('imputer',
Pipeline(steps=[('knnimputer',
KNNImputer(n_neighbors=10))]),
['V1', 'V2', 'V3', 'V4', 'V5',
'V6', 'V7', 'V8', 'V9',
'V10', 'V11', 'V12', 'V13',
'V14', 'V15', 'V16', 'V17',
'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25',
'V26', 'V27', 'V28', 'V29',
'V30', ...])])),
('Random Forest',
RandomForestClassifier(max_samples=0.6, n_estimators=300))])ColumnTransformer(remainder='passthrough',
transformers=[('scaler',
Pipeline(steps=[('minmaxscaler',
MinMaxScaler())]),
['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7',
'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14',
'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
'V21', 'V22', 'V23', 'V24', 'V25', 'V26',
'V27', 'V28', 'V29', 'V30', ...]),
('imputer',
Pipeline(steps=[('knnimputer',
KNNImputer(n_neighbors=10))]),
['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7',
'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14',
'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
'V21', 'V22', 'V23', 'V24', 'V25', 'V26',
'V27', 'V28', 'V29', 'V30', ...])])['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40']
MinMaxScaler()
['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40']
KNNImputer(n_neighbors=10)
[]
passthrough
RandomForestClassifier(max_samples=0.6, n_estimators=300)
model.predict(X_test_pipeline) # predicts the model with the testing data
array([0, 0, 0, ..., 0, 0, 0])
model_performance_classification_sklearn(model, X_test_pipeline, y_test_pipeline)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9796 | 0.7766 | 0.8488 | 0.8111 |
confusion_matrix_sklearn(model, X_test_pipeline, y_test_pipeline)
print(
"Total service cost for tuned XGBoost on test data: ${:,}".format(
calculate_total_cost(model, X_test_pipeline, y_test_pipeline)
),
)
Total service cost for tuned XGBoost on test data: $7,095,000
Estimated production total service cost is 7,095,000.
The pipeline is showing Accuracy of 97.96%, Recall of 77.66%, Precision of 84.88%, and F1 81.11%.
After developing 18 different machine learning models, we selected the final model, Random Forest Tuned with Random Search. We addressed the target class imbalance (few 'failures' and many 'no failures') and optimized the model's performance using hyperparameter tuning and cross-validation. By testing with original, undersampled, and oversampled data, and employing a pipeline, we developed the most effective model possible with the available data.
Random Forest Tuned with Random Search.
The data showed that the generator did not fail in 94.36% of cases and failed in 5.64% of cases.
Top 5 sensors to watch:
%%shell
jupyter nbconvert --to html //'/content/drive/MyDrive/Python_Course/Project_6/MT_Project_LearnerNotebook_FullCode_Updated.ipynb'
[NbConvertApp] Converting notebook ///content/drive/MyDrive/Python_Course/Project_6/MT_Project_LearnerNotebook_FullCode_Updated.ipynb to html [NbConvertApp] Writing 4389438 bytes to /content/drive/MyDrive/Python_Course/Project_6/MT_Project_LearnerNotebook_FullCode_Updated.html